A day with (the) Julia (language)

Análisis de datos

JuliaStats Statistics and Machine Learning made easy in Julia.


In [29]:
# Pkg.add("DataFrames")

In [30]:
using DataFrames # DataFrames to represent tabular datasets
                 # Database-style joins and indexing
                 # Split-apply-combine operations, reshape and pivoting
                 # Formula and model frames

In [31]:
run(`head data/iris.csv`)


"SepalLength","SepalWidth","PetalLength","PetalWidth","Species"
5.1,3.5,1.4,0.2,"setosa"
4.9,3.0,1.4,0.2,"setosa"
4.7,3.2,1.3,0.2,"setosa"
4.6,3.1,1.5,0.2,"setosa"
5.0,3.6,1.4,0.2,"setosa"
5.4,3.9,1.7,0.4,"setosa"
4.6,3.4,1.4,0.3,"setosa"
5.0,3.4,1.5,0.2,"setosa"
4.4,2.9,1.4,0.2,"setosa"

In [32]:
iris = readtable("data/iris.csv")


Out[32]:
SepalLengthSepalWidthPetalLengthPetalWidthSpecies
15.13.51.40.2setosa
24.93.01.40.2setosa
34.73.21.30.2setosa
44.63.11.50.2setosa
55.03.61.40.2setosa
65.43.91.70.4setosa
74.63.41.40.3setosa
85.03.41.50.2setosa
94.42.91.40.2setosa
104.93.11.50.1setosa
115.43.71.50.2setosa
124.83.41.60.2setosa
134.83.01.40.1setosa
144.33.01.10.1setosa
155.84.01.20.2setosa
165.74.41.50.4setosa
175.43.91.30.4setosa
185.13.51.40.3setosa
195.73.81.70.3setosa
205.13.81.50.3setosa
215.43.41.70.2setosa
225.13.71.50.4setosa
234.63.61.00.2setosa
245.13.31.70.5setosa
254.83.41.90.2setosa
265.03.01.60.2setosa
275.03.41.60.4setosa
285.23.51.50.2setosa
295.23.41.40.2setosa
304.73.21.60.2setosa
&vellip&vellip&vellip&vellip&vellip&vellip

Descripción (estadística) del dataset (columnas), similar a summary de R.


In [33]:
describe(iris)


SepalLength
Min      4.3
1st Qu.  5.1
Median   5.8
Mean     5.843333333333334
3rd Qu.  6.4
Max      7.9
NAs      0
NA%      0.0%

SepalWidth
Min      2.0
1st Qu.  2.8
Median   3.0
Mean     3.0573333333333332
3rd Qu.  3.3
Max      4.4
NAs      0
NA%      0.0%

PetalLength
Min      1.0
1st Qu.  1.6
Median   4.35
Mean     3.7579999999999996
3rd Qu.  5.1
Max      6.9
NAs      0
NA%      0.0%

PetalWidth
Min      0.1
1st Qu.  0.3
Median   1.3
Mean     1.1993333333333331
3rd Qu.  1.8
Max      2.5
NAs      0
NA%      0.0%

Species
Length  150
Type    UTF8String
NAs     0
NA%     0.0%
Unique  3


In [34]:
using Gadfly # Similar a ggplot2 de R

In [35]:
plot(iris, x="Species", y="PetalLength", color="Species", Geom.boxplot)


Out[35]:
Species setosa versicolor virginica setosa versicolor virginica Species -10 -8 -6 -4 -2 0 2 4 6 8 10 12 14 16 18 -8.0 -7.5 -7.0 -6.5 -6.0 -5.5 -5.0 -4.5 -4.0 -3.5 -3.0 -2.5 -2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0 8.5 9.0 9.5 10.0 10.5 11.0 11.5 12.0 12.5 13.0 13.5 14.0 14.5 15.0 15.5 16.0 -10 0 10 20 -8.0 -7.5 -7.0 -6.5 -6.0 -5.5 -5.0 -4.5 -4.0 -3.5 -3.0 -2.5 -2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0 8.5 9.0 9.5 10.0 10.5 11.0 11.5 12.0 12.5 13.0 13.5 14.0 14.5 15.0 15.5 16.0 PetalLength

In [36]:
plot(iris, color="Species", x="PetalLength", Geom.histogram)


Out[36]:
PetalLength -10 -8 -6 -4 -2 0 2 4 6 8 10 12 14 16 18 -8.0 -7.5 -7.0 -6.5 -6.0 -5.5 -5.0 -4.5 -4.0 -3.5 -3.0 -2.5 -2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0 8.5 9.0 9.5 10.0 10.5 11.0 11.5 12.0 12.5 13.0 13.5 14.0 14.5 15.0 15.5 16.0 -10 0 10 20 -8.0 -7.5 -7.0 -6.5 -6.0 -5.5 -5.0 -4.5 -4.0 -3.5 -3.0 -2.5 -2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0 8.5 9.0 9.5 10.0 10.5 11.0 11.5 12.0 12.5 13.0 13.5 14.0 14.5 15.0 15.5 16.0 setosa versicolor virginica Species -20 -15 -10 -5 0 5 10 15 20 25 30 35 -15.0 -14.5 -14.0 -13.5 -13.0 -12.5 -12.0 -11.5 -11.0 -10.5 -10.0 -9.5 -9.0 -8.5 -8.0 -7.5 -7.0 -6.5 -6.0 -5.5 -5.0 -4.5 -4.0 -3.5 -3.0 -2.5 -2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0 8.5 9.0 9.5 10.0 10.5 11.0 11.5 12.0 12.5 13.0 13.5 14.0 14.5 15.0 15.5 16.0 16.5 17.0 17.5 18.0 18.5 19.0 19.5 20.0 20.5 21.0 21.5 22.0 22.5 23.0 23.5 24.0 24.5 25.0 25.5 26.0 26.5 27.0 27.5 28.0 28.5 29.0 29.5 30.0 -20 0 20 40 -15 -14 -13 -12 -11 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30

In [37]:
plot(iris, x=:PetalLength, y=:PetalWidth, color=:Species, Geom.point, Geom.smooth(method=:lm))


Out[37]:
PetalLength -10 -8 -6 -4 -2 0 2 4 6 8 10 12 14 16 18 -8.0 -7.5 -7.0 -6.5 -6.0 -5.5 -5.0 -4.5 -4.0 -3.5 -3.0 -2.5 -2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0 8.5 9.0 9.5 10.0 10.5 11.0 11.5 12.0 12.5 13.0 13.5 14.0 14.5 15.0 15.5 16.0 -10 0 10 20 -8.0 -7.5 -7.0 -6.5 -6.0 -5.5 -5.0 -4.5 -4.0 -3.5 -3.0 -2.5 -2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0 8.5 9.0 9.5 10.0 10.5 11.0 11.5 12.0 12.5 13.0 13.5 14.0 14.5 15.0 15.5 16.0 setosa versicolor virginica Species -3.0 -2.5 -2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 -2.5 -2.4 -2.3 -2.2 -2.1 -2.0 -1.9 -1.8 -1.7 -1.6 -1.5 -1.4 -1.3 -1.2 -1.1 -1.0 -0.9 -0.8 -0.7 -0.6 -0.5 -0.4 -0.3 -0.2 -0.1 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2.0 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 3.0 3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 3.9 4.0 4.1 4.2 4.3 4.4 4.5 4.6 4.7 4.8 4.9 5.0 -2.5 0.0 2.5 5.0 -2.6 -2.4 -2.2 -2.0 -1.8 -1.6 -1.4 -1.2 -1.0 -0.8 -0.6 -0.4 -0.2 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0 2.2 2.4 2.6 2.8 3.0 3.2 3.4 3.6 3.8 4.0 4.2 4.4 4.6 4.8 5.0 PetalWidth

In [38]:
# Pkg.add("GLM")

In [39]:
using GLM # Generalized linear models

linear = fit(LinearModel, PetalWidth ~ PetalLength, iris) # PetalLength en R: 0.4157554


Out[39]:
DataFrames.DataFrameRegressionModel{GLM.LinearModel{GLM.DensePredQR{Float64}},Float64}:

Coefficients:
              Estimate  Std.Error  t value Pr(>|t|)
(Intercept)  -0.363076   0.039762 -9.13122   <1e-15
PetalLength   0.415755 0.00958244  43.3872   <1e-85

In [40]:
using Clustering

In [41]:
cl = kmeans(convert(Matrix{Float64}, iris[:, [:PetalWidth, :PetalLength]])', 3)


Out[41]:
Clustering.KmeansResult{Float64}(2x3 Array{Float64,2}:
 2.0375   0.246  1.34231
 5.59583  1.462  4.26923,[2,2,2,2,2,2,2,2,2,2  …  1,1,1,1,1,1,1,1,1,1],[0.00596,0.00596,0.02836,0.00356,0.00596,0.08036,0.00676,0.00356,0.00596,0.02276  …  0.131424,0.314757,0.264757,0.161424,0.224757,0.22559,0.373924,0.15809,0.107257,0.302257],[48,50,52],[48.0,50.0,52.0],31.371358974358984,5,true)

In [42]:
cl.centers


Out[42]:
2x3 Array{Float64,2}:
 2.0375   0.246  1.34231
 5.59583  1.462  4.26923

In [43]:
by(iris, :Species, df -> (mean(df[:PetalWidth]), mean(df[:PetalLength])))


Out[43]:
Speciesx1
1setosa(0.24600000000000002,1.462)
2versicolor(1.3259999999999998,4.260000000000001)
3virginica(2.026,5.5520000000000005)